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Abstract 



< 

Transfer entropy, an information-theoretic measure of time-directed information trans- 
it for between joint processes, has steadily gained popularity in the analysis of complex 
stochastic dynamics in diverse fields, including the neurosciences, ecology, climatology 
and econometrics. We show that for a broad class of predictive models, the log-likelihood 
ratio test statistic for the null hypothesis of zero transfer entropy is a consistent estimator 
for the transfer entropy itself. For finite Markov chains, furthermore, no explicit model is 
required. In the general case, an asymptotic \ 2 distribution is established for the transfer 
entropy estimator. The result generalises the equivalence in the Gaussian case of transfer 
entropy and Granger causality, a statistical notion of causal influence based on predic- 
tion via vector autoregression, and establishes a fundamental connection between directed 
information transfer and causality in the Wiener-Granger sense. 

Transfer entropy (TE) was formulated by Schreiber [25] as a non-parametric measure of 
directed (time-asymmetric) information transfer between joint processes. It has since rapidly 
• i-H gained popularity, particularly in neuroscience [14, 10, 28, 29, 20], as a tool for data-driven 

detection of functional coupling between joint processes. In [3] it was shown that for Gaussian 
vector autoregressive (VAR) processes, transfer entropy is equivalent to Granger causality 
[31, 11, 8]. Noting that the Granger causality statistic may be formalised as a log-likelihood 
ratio, this Letter extends the result in [3] (see also [1, 26]) to a very general class of continuous 
or discrete Markov models in a maximum likelihood framework. In the case of a finite state 
space, moreover, no explicit model is required since the result applies to the conventional 
plug-in estimator for TE. 

The result is of particular significance since estimation of TE in sample — particularly 
for continuous systems — is notoriously awkward [25, 16, 24]; furthermore, little has been 
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known previously about its sampling properties. The likelihood formulation presented here 
provides a convenient route to estimation and statistical inference of TE in a broad parametric 
context, and on a conceptual level pinpoints the relationship between information-theoretic 
directed information transfer in the Schreiber sense and predictive, model-based causality in 
the Wiener- Granger sense. 

Firstly we introduce some notation 1 : for a time-indexed sequence x = (xt), t = 1,2,... 
we write x l = {x\, x%, . . . , xt) for the subsequence up to time t, and x k = (xt-ki ■ ■ ■ > x t-i) for 
the £>lag history of the sequence up to time t—1. Suppose now that (X, Y) = (X t , Y t ) are 
jointly stationary stochastic processes taking values in the state space Sx x Sy. (Later we 
shall also require an ergodicity condition on the joint process.) We then have the following 
expressions for the entropy of Xt conditional on the joint (fc-lag) history of X and Y, and on 
the history of X only: 

H(x t \X k ,Y k ) = -v(\ogp{X t \XtY t k )) 
H(x 4 |X t fc ) =-E(logp(X t |X t fc )) 

(assuming that the expectations are < oo) where p(xt\x k ,y k ), p(xt\x k ) are respectively the 
marginal probability density functions (pdfs) of Xt conditioned on the history of X and Y, 
and on the history of X only. By stationarity, these densities do not depend on t. The (k-l&g) 
transfer entropy from Y — > X is then defined by: 

Ty^x ^n(x t \x k ) -n(x t \x k ,Y k ) , (2) 

which may be read as "the degree to which the history of Y disambiguates current X beyond 
the degree to which X is already disambiguated by its own history" . 

Suppose now given a parametrised Markov (predictive) model for Xt in terms of the 
history of X and Y: 

p(x t \x t -\y t - 1 ;0) = f(x t \x k ,y k ;O), (3) 

where = (9i, . . . , 9 m ) is a vector of parameters in a (Euclidean) parameter space 0, /(•(•, •; 6) 
is a conditional probability density function from Sx x S x x5y -> [0, 1] and p(xt\x t ~ 1 ,y t ~ 1 ; 6) 
is the marginal pdf of Xt conditioned on the entire history of X and Y under the model 
assumption with parameter vector 9. Eq. (3) describes a "partial" model, insofar as only the 
marginal conditional distribution of X is specified. We assume that the model is identifiable 
and well- specified: that is, 0\ ^ O2 => /(-|-,-;^i) 7^ /("I")'! ^2), an d there is a unique true 
parameter vector 0* satisfying 

pixtlx*- 1 , y 1 ' 1 ) = f(xt\x k , y\\ 0*) = p{x t \x k , y k ) . (4) 

We do not demand that the joint process (X, Y) satisfy a Markov property analogous to (4). 

To apply likelihood methods we require a "full" rather than a partial, model 2 . However, 
we are not really interested in modelling Y, but just in how its history "drives" X. As 
a mathematical device, then, we extend the partial model (somewhat arbitrarily) to a full 
model, for which the marginal distribution of Xt conditional on joint history agrees with (3). 

In what follows upper case symbols denote random variables and bold typeface indicates vector quantities. 
2 An alternative approach would be to regard the yt as nuisance parameters; however, we believe this 
complicates the analysis. 
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Let q(y) be the marginal (unconditional) pdf of Y t . We define the extended model (again 
parametrised by 9 E 0) by 3 

p{x t ,y t \x t - x ,y t - 1 ;e) = f(x t \x*,y*;0)q(y t ), (5) 

where p(xt, ?/i|£c i_1 , y l ~ l ; 9) is the conditional joint pdf of X t , Y t on their joint history under 
the model assumption with parameter vector 9. Eq. (5) may be interpreted as "the Yj are 
independent of Xt conditioned on joint history". Note that in fact this condition will in 
general not actually hold for the joint process (X , Y); i.e. in general there will be no 9 S G 
such that p(xt,yt\x t ~ l ,y t ~ 1 ) = f(xt\x k ,y k ;9)q(y t ) [cf. (4)]. In other words, the model (5) 
will generally be misspecified. As we shall see, however, this is not problematic. 

Misspecified or not, there is nothing preventing calculation of likelihoods for the extended 
model (5). Suppose given a realisation (x n ,y n ) of length n sampled from the joint process 
(X,Y). The likelihood of 9 given this realisation is defined as 

<C(0\x n ,y n )=p(x n ,y n ;0), (6) 

i.e. the pdf of the joint history (X n , Y n ) under the model assumption with parameter vector 
9, and the average log-likelihood estimator is defined to be 

£(9\x n ,y n ) = J—log/:(e\x n ,y n ) (7) 
n — k 

(the factor of n — k is due to the effective loss of k samples due to lags). We have by Bayes' 
theorem 

C(0\x n , y n ) = p(x n , y n \x n -\ y n ~ l ; e)p{x n ~\ y"" 1 ; 6) 
= f(x n \x k n , y k n - 6)q(y n ) C(6\x n -\ y^ 1 ) 

from (3), leading to 

n 

£(0\x n ,y n )=p(x k ,y k ) [] [f(x t \x k , y k ; 0)q(y t )} (8) 

t=k+l 

where p(x k ,y k ), the initial joint distribution, is assumed given independently of 9. Noting 
that the q(yt) do not reference the parameter vector 9, we see that up to an additive factor 
not depending on 9 — which does not affect calculation of either maximum likelihood (ML) 
estimates or likelihood ratios — and assuming q{y) ^ almost everywhere (which in fact 
follows from the ergodic assumption introduced below), the average log-likelihood estimator 
is just 

1 n 

i(9\x n ,y n ) = r V logf(x t \x k ,yt,0). (9) 

n — k ^-^ 

t=k+l 

In practice this expression may be used to estimate an appropriate model order (i.e. number of 
lags k) via standard likelihood-based techniques such as the Akaike or Bayesian Information 
Criteria [22]. 

3 In fact, as regards the subsequent argument, the marginal Y t pdf q(y) in (5) could be replaced by an 
arbitrary pdf q(y) with q(y) ^ almost everywhere. 
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We now make the following ergodic assumption: 



The process U is ergodic, where U t = (X t -k, ■ ■ ■ , X t , Y t -k, • • • , Yt-i) ■ (10) 
The Birkhoff-Khinchin ergodic theorem [5] then applies, so that 4 

£(9\X n ,Y n ) E(logf(X t \X k ,Y k ;0)) (11) 



as n — > oo, again assuming that the expectation exists (at least for almost all 9). In particular, 
for the true parameter 9* it follows from (4) and the definition (1) that 

Proposition 1. 

£(9*\X n ,Y n ) -n(x t \X k ,Y k ^) (12) 

as n —> oo. □ 

Proposition 1 encapsulates the relationship between average log-likelihood and conditional 
entropy for a Markov model, which is key to our main result. Now by Gibbs' Inequality, we 
have E(logf(X t \X k ,Y t k ;0)) < E(logp(X t \Xj?,Y t k )) for all 9, with equality iff f(-\-,-;0) = 
p(-|-, ■) almost everywhere. By uniqueness of 9*, 9 = 9* maximises £{9\x n ,y n ) almost surely 
in the limit n — > oo and we have 



Proposition 2. 

0(X n ,Y n ) ^ 0* (13) 



as n — > oo, where 6(X n ,Y n ) is the ML estimator for the extended model (5). □ 

Thus, notwithstanding that the extended model may be misspecified, its ML estimator 
6(X n , Y n ) is nonetheless a consistent estimator for the true value 6* of the parameter vector 
for the partial model (3). In particular, Proposition 1 holds with 9* replaced by 9(X n ,Y n ). 

We now define a nested null model, with the object of testing the null hypothesis that 
the TE (2) is zero. Assuming the partial model (3) with parameter vector 9, it is clear that 
Ty^x = iff f(xt\x k ,y k ;9) does not depend on y k . Accordingly, we define the null set 
Bo ' Bl'.v 

G = {9 e G | f(x t \x k , y k ; 9) does not depend on y k } . (14) 

We assume that Go 7^ 0. Given a realisation (x n ,y n ) of the joint process (X , Y), the likelihood 
ratio test statistic for the null hypothesis Hq : 9 £ Qq is 

V ' C(9 \x n ,y n ) V ' 

where 9 and 9q are ML estimators for 9 over the full parameter set G and the null subset Go 
respectively. The following Theorem justifies defining the model TE estimator 

f Y ^x{x n ,y n ) = ^rlogACa^y"). (16) 

n — k 

Theorem 1. 



denotes almost sure convergence. 
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(a) fy^x (X n , Y n ) Ty^x asn^oo. 



(b) If Ty^x = 0, then 1(n — k) Ty-^x (X n , Y n ) has an asymptotic x 2 (d) distribution, 
where the number of degrees of freedom d is the difference between the number of pa- 
rameters in the full and null models. If Ty^x > 0, the asymptotic distribution is 
non-central x 2 (d; A) with non-centrality parameter A = 2(n — k) Ty^x- D 

Theorem la follows immediately from Propositions 1,2 and definitions (2) and (7); the esti- 
mator (16) is thus consistent, although it will generally be biased. Theorem lb follows from 
the standard large-sample theory [30] (but note that convergence to the non-central x 2 wm 
generally be slower than in the null case). It enables significance testing of the null hypothe- 
sis of zero TE, and the construction of confidence intervals for the estimator if the sample is 
sufficiently large. 

We note the ergodicity condition (10) restricts the class of models for which our analysis 
applies, although the restriction may not be that stringent. For example, for a (possibly 
nonlinear) VAR process of the form Xt = g(X k ) + et with iid residuals et, the joint process 
(X t -£, . . . , Xt) will be ergodic for any lag t if (roughly) for any region 1Z C Sx, P(e* £ H) = 
==$■ TZ has measure zero [7]. (This will be the case, for instance, if the residuals are 
nondegenerate multivariate Gaussian.) A discrete- valued Markov processes is ergodic if every 
state is aperiodic and positive recurrent [7]. 

For a linear VAR partial model of the form Xt = Yli=i AiXt~i + Yli=i BiXt—i + Et with iid 
multivariate Gaussian residuals et with covariance matrix £ and regression coefficient matrices 
Ai,Bi, the ergodicity condition is met provided the generalised variance [4] |S| is > 0. The 
parameter vector is 6 = (Ai,Bi,T,) and the null set Go is given by B\ = . . . = B\~ = 0. 
The ML is (up to a factor) |E|~A"~ fc )/ 2 where S is the ML estimator for X, which we note is 
asymptotically equivalent to the conventional OLS estimator [13]. The TE estimator (16) is 
then just half the Granger causality from Y to X [ ], and we recover the result of [•'>]. 

Given a third process Z jointly distributed with (X, Y), the effect of Z on the information 
flow Y — > X may be "conditioned out" by defining the conditional transfer entropy [25, 9 

T Y ->x\z^il(Xt\X?,Z?)-Il(x t \X?,Y t k ,Z*) . (17) 

It may be verified that Theorem 1 extends to the conditional case for partial Markov models 
of the form 

pfalx*- 1 , y*" 1 , z l - x - 6) = f{x t \x k u yl z*" 1 ; 6) . (18) 

In the discrete case, the various densities p(- ■ ■ ) are actual probabilities. We may calculate 
from (9) [for notational compactness we suppress explicit dependence on a realisation (x 11 , y n )] 
that 

i(e)= m,tt,v^ogfmtvt,o), (19) 

>*Jt 

where 

_. n t t—1 

m,ii^)= — - n 5 &> x J n %»»«) (20) 

ft K 

t=k+l s=t—k u=t—k 

are plug-in estimates for the densities p(xt,x k ,y k ) (by the ergodic assumption they are are 
consistent). Eq. (19) then furnishes a direct route to calculation of the maximum-likelihood 
estimator 6 and thence the TE estimator (16). 
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In the absence of an explicit model, the conventional non-parametric estimator for TE in 
the discrete case is the plug-in estimator 



(with the obvious notation). Now in the case of a finite state space, we can consider the 
transition probabilities p(£t\£t ,Vt) themselves as model parameters; i.e. the in a partial 
Markov model (3). Then it is clear that the p{£t\£tiVt) wm be ML estimators, and the 
plug-in estimator (21) is just the model estimator (16). If the cardinalities of the state spaces 
Sxi Sy are a, b respectively, then there are (a—l)a k b k independent parameters for the non-null 
model and (a — l)a k for the null model. Theorem lb then states that the plug-in estimator 
has an asymptotic x 2 distribution with (a — l)a k (b k — 1) degrees of freedom. Note that this 
scales polynomially with state space size and exponentially with model order. 

This is illustrated in the following example of a stationary, bivariate first-order Markov 
chain on the binary state space {0, 1} x {0, 1}. Although manifestly a toy model, we present 
this example partly for its close resemblance to a canonical minimal model analysed in [16], 
but mostly as it illustrates nicely the asymptotic convergence of the TE estimator to the 
theoretical x 2 distribution. For t = 1,2, .. ., £t and r\t are iid B(^) random variables 5 and 
Ut,vt are iid B(0),B(p) respectively. We then define 



X t = UfYt-i + (1 - u t )st 
Y t = v t X t -\ + (1 - v t )rj t . 



(22) 



It is clear that the ergodic requirement is satisfied provided ft < 1 and <p < 1. We may 
calculate 

p{xt,yt\x t -i,yt-i) = p(x t \xt-i,yt-i)q(yt\xt-i,yt-i) (23) 



where the conditional marginals are 

,_,) = ftS(.r,,.m^) + id -ft) 

(24) 



p(xt\x t -i,yt-i) = 0S(x t ,yt-i) + |(1 - 0) 
q(y t \x t -i,y t -i) = ipS(y t ,x t -i) + |(1 - if) • 



Note that in this case the extended model (5) is indeed misspecified, since from (22) it can 
be seen that Yt is not independent of Xt conditioned on joint history; the joint conditional 
pdf (23) does not factor according to (5). 

By the x f-> y,ft ip symmetry, it follows that the stationary joint density is uniform 
p(x, y) = j for all x, y, and we may calculate 

Ty^x = |(l + 0) log(l + 6) + 1(1 - ft) log(l - ft) . (25) 

By Theorem 1 the TE plug-in estimator (21) in this case has an asymptotic x 2 (2) distribu- 
tion under the null hypothesis of zero TE, and a non-central x 2 (2) distribution under the 
alternative hypothesis. FIG. 1 plots empirical vs. theoretical x 2 (2) cumulative distributions 
(cdfs) of f p Y ^x for a null model (ft = 0, FIG. 1A) and a non-null model (ft = 0.4, FIG. IB), 
estimated from 10 5 realisations each of a selection of sequence lengths. The (p parameter was 
set to 0.6. We see that the null distribution converges more quickly to its x 2 (2) asymptote 

5 B(p) denotes a Bernoulli trial taking the value 1 with probability p. 
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Figure 1: Cumulative distribution functions (cdfs) for the plug-in TE estimator (21) and 
corresponding scaled x 2 {2) theoretical asymptotic distributions (bold lines), for A a null model 
(8 = 0) and B a non-null model (0 = 0.4) for the binary Markov process (22) at a series of sequence 
lengths n. The vertical arrow marks the actual TE value Ty^x ~ 0.0823. 

than the non-null distribution. 

With regard to application of the finite-state plug-in estimator result, we note that it is 
not uncommon to discretise continuous time series data for causal analysis, sometimes after 
a differencing step. A more sophisticated discretisation approach is the recently introduced 
symbolic transfer entropy [27], currently gaining in popularity as a technique for inference of 
time-directed information flow applicable to systems of continuous variables. 

A recognised problem with transfer entropy estimation in the non-parametric case is the 
so-called "curse of dimensionality" [ ] , in particular with regard to scaling of degrees of free- 
dom with number of lags. Thus for continuous processes, naive coarse- graining and binning 
of data points is likely to be extremely inefficient in practice, yielding large estimation errors; 
worse, estimators derived in this fashion may not even converge monotonically to the correct 
value [16]. There are several approaches to mitigation of this problem, including adaptive par- 
titioning, kernel density methods [16] and k- nearest neighbour statistics [ 8]. An interesting 
avenue of research is whether any of these methods might be framed in a parametric context 
for which our result applies, thus yielding a useful x 2 sampling distribution. Another promis- 
ing approach is proposed in [24], where transfer entropy is decomposed into contributions 
of individual lags and then conditional dependencies reduced using methods from graphical 
model theory [19]. Again, it would be of interest to investigate whether a similar dimensional 
reduction could be achieved within a model-based ML framework. 

In a parametric context, it might be argued that dimensionality is a lesser curse: prior 
to parametric TE estimation an empirical model order should be selected in accordance with 
the size of the available sample, most conveniently, as noted previously, via the average log- 
likelihood estimator (9). In practice this will limit the dimensionality of the parameter space 
and will frequently yield tractable model orders, resulting in acceptably efficient estimators. 
Thus for example, Granger causality has been effectively applied (especially in the neuro- 
sciences) for highly multivariate data. For discrete systems, and in particularly for symbolic 
transfer entropy, it seems not unreasonable to expect similar, although further research is 
required. 

The special case of Granger causality suggests a useful range of applications for Theo- 
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rem 1. Linear VAR modelling seems sometimes to be taken as a convenient "one size fits all" 
approach, insofar as (almost) any wide-sense stationary process may be modelled as a linear 
VAR, albeit of potentially high model order [13, 8]. However, this is certainly not to say that 
a linear VAR model will necessarily be a good model for given time series data. In particular, 
high empirical model orders (as indicated by model selection criteria) may well be indicative 
of failure of a model to reflect parsimoniously the statistical structure of the data. Physical 
considerations or supplementary analysis may suggest alternative models. Thus for example 
a nonlinear VAR approach [2, 6] might be suspected on physical grounds, heteroscedasticity 
may suggest a GARCH model [2 J] or long-term memory a fractional ARIMA model [12, 15]. 
Now it may be far from clear how one should define a Granger-like predictive statistic for 
such models. Theorem 1, and its implied equivalence (at least in the Gaussian case) with 
transer entropy, suggests that the principled generalisation of Granger causality is just the 
model transfer entropy, operationalised as a log-likelihood ratio - and in addition furnishes 
an asymptotic sampling distribution. 

In summary, our result unifies the established machinery of maximum likelihood and 
the concomitant large-sample theory with the information-theoretic notion of directed in- 
formation transfer. Given a — perhaps physically motivated — parametric model (or, if the 
state space is discrete, an implicit Markov chain model) it thereby facilitates estimation and 
statistical inference of transfer entropy. In particular it furnishes generic asymptotic x 2 dis- 
tributions for significance testing and estimation of confidence intervals, obviating the need 
for surrogate/subsampling methods. 

Finally, current research by the authors suggests an extension of the result to point pro- 
cesses, with application to inference of information flow in spiking neural systems [23, 17]. 

Author Barnett's work has been supported by the Dr. Mortimer and Theresa Sackler 
Foundation via the Sackler Centre for Consciousness Science, and a visiting fellowship at the 
Centre for Research in Complex Systems. 
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